3 days ago i was looking through the gtdbtk manual and saw that
de_novo_wf was an option for analysis to create the trees,
from the description given:
knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")
i beleived this would be something i should do as it might produce
more accurate trees. sample 1Dt2d Enterobacter
cancerogenus had been placed by the classify_wf in the
previous gtdbtk analysis in the genus Pantoea, which lead me to
this search. after a bit of trial and error, i produced this
script
This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were
- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10
I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.
This analysis produced these files:
/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log
./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv
./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv
./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results
./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree
I then moved this gtdbtk.bac120.decorated.tree file into
Dendroscope for review, all 10 are on one tree, but 1Dt2d
is still being placed in the “wrong” genus. on review of its sister
accession on the ncbi database.
On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with
completeness: 90%
contamination: 3.6%
Taxonomy check status: failed
Upon viewing the tree in Dendroscope, the joining node has the label
0.968. This I believe to be the probability the
relationship is correct. this implies they are the same species, and the
online sample is also identified as Enterobacter cancerogenus.
However, due to the checkm analysis i find it plausible that they both
have been misidentified and are in reality Pantoea species, i find this
the most parsimonious explanation. I will follow this up with a CheckM
analysis of my own on 1Dt2d
This was a “technical spike” or proof of concept for
de_novo_wf
i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.
using this
script i ran a slurm job on hawk under the lineage_wf
of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I
also worked on exporting the data i want to tabulate off of hawk. The
past way i did this was by manually entering each file
and noting down the important characteristics. However, because there
are going to be more samples(and i wanted to be clever) i decided to use
a more automated process. This was done by identifying different
documents in the flye directories on hawk, specifically files called
“assembly_info.txt” which contain the same information, but are vastly
more exportable. These are stored here: cd
/scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this
script i exported them off of hawk.
the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.
with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).